Fall back to nogds at runtime when cuFile handle registration fails#87
Open
gitbisector wants to merge 1 commit into
Open
Fall back to nogds at runtime when cuFile handle registration fails#87gitbisector wants to merge 1 commit into
gitbisector wants to merge 1 commit into
Conversation
Collaborator
|
Thanks for the contribution. The change looks good to me, but please add signed-off-by line in your commit. Then, I will merge this after the change #81. |
e5b8794 to
96ff25f
Compare
cuFile can probe as available yet fail cuFileHandleRegister at I/O time (compat-mode hosts without nvidia-fs, checkpoints on overlayfs, CI runners). Catch the failure in submit_io, warn once, and transparently delegate the copier to the nogds bounce path -- so every consumer stops carrying its own gds->nogds retry wrapper. The fallback (and its bounce-buffer reader) lives only for the file's submit/wait cycle. Signed-off-by: git bisector <gitbisector@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>
96ff25f to
7775b5e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
cuFile can probe as available (
is_gds_supported()/is_cufile_found()pass) yet failcuFileHandleRegisterat I/O time — compat-mode hosts without the nvidia-fs kernel module, checkpoints on filesystems cuFile can't register (e.g. overlayfs on CI runners), etc. Today that surfaces as a hardRuntimeError: raw_gds_file_handle: cuFileHandleRegister returned an error = 5027fromGdsFileCopier.submit_io, and every consumer ends up carrying its own gds→nogds retry wrapper — e.g. vllm-project/vllm#40183 needed exactly that fix when its CI runner hit 5027 (weights on overlayfs, probe says GDS is fine).This catches the registration failure inside
GdsFileCopier.submit_io, warns once, and transparently delegates the copier to the nogds bounce path. The fallback copier (and its bounce-buffer reader) lives only for that file's submit/wait cycle, so no pinned memory outlives the load.Test: monkeypatched
gds_file_handleraising the 5027 error → load completes via the fallback, tensors byte-identical tosafetensors.load_file, and the bounce buffer is released afterwards.make lintclean; unit suite green on CPU and on a CUDA host where cuFile genuinely fails registration (the fallback turns that previously-failing environment green).